Statistical learning: moving beyond linearity

MACS 30500
University of Chicago

February 22, 2017

Linearity in linear models

  • Linear assumption
  • Why this is wrong
  • When to relax the assumption

Linearity of the data

\[Y = 2 + 3X + \epsilon\]

\(\epsilon\) is \(N(0,3)\)

Linearity of the data

Linearity of the data

Non-linearity

\[Y = 2 + 3X + 2X^2 + \epsilon\]

Non-linearity

Non-linearity

Non-constant variance of the error terms

\[\epsilon_i = \sigma^2\]

  • Homoscedasticity
  • Heteroscedasticity

Homoscedastic

\[Y = 2 + 3X + \epsilon\]

\(\epsilon\) is random error distributed normally \(N(0,1)\)

Heteroscedastic

\[Y = 2 + 3X + \epsilon\]

\(\epsilon\) is random error distributed normally \(N(0,\frac{X}{2})\).

How to relax the assumption

  • Monotonic transformations
  • Step functions
  • Splines
  • Local regression
  • Generalized additive models (GAMs)

Ladder of powers

Transformation Power \(f(X)\)
Cube 3 \(X^3\)
Square 2 \(X^2\)
Identity 1 \(X\)
Square root \(\frac{1}{2}\) \(\sqrt{X}\)
Cube root \(\frac{1}{3}\) \(\sqrt[3]{X}\)
Log 0 (sort of) \(\ln(X)\)

Ladder of powers

Which transformation should I use?

Log transformations

  • One-sided transformation of \(Y\)

    \[\ln(Y_i) = \beta_0 + \beta_{1}X_i + \epsilon_i\]

    \[E(Y) = e^{\beta_0 + \beta_{1}X_i}\]

    \[\frac{\vartheta E(Y)}{\vartheta X} = e^{\beta_1}\]

  • One-sided transformation of \(X\)

    \[Y_i = \beta_0 + \beta_{1} \ln(X_i) + \epsilon_i\]

Log-log regressions

\[\ln(Y_i) = \beta_0 + \beta_{1} \ln(X_i) + \dots + \epsilon_i\]

  • Elasticity

    \[\text{Elasticity}_{YX} = \frac{\% \Delta Y}{\% \Delta X}\]

  • A direct means of interpreting a nonlinear effect
  • A double multiplicative relationship

Polynomial regressions

\[y_i = \beta_0 + \beta_{1}x_{i} + \epsilon_{i}\]

\[y_i = \beta_0 + \beta_{1}x_{i} + \beta_{2}x_i^2 + \beta_{3}x_i^3 + \dots + \beta_{d}x_i^d + \epsilon_i\]

Biden and age

\[\text{Biden}_i = \beta_0 + \beta_1 \text{Age} + \beta_2 \text{Age}^2 + \beta_3 \text{Age}^3 + \beta_4 \text{Age}^4\]

Variance-covariance matrix

Variance-covariance matrix of Biden polynomial regression
(Intercept) I(age^1) I(age^2) I(age^3) I(age^4)
(Intercept) 620.00316 -56.31558 1.76432 -0.02291 0.00011
I(age^1) -56.31558 5.20765 -0.16556 0.00218 -0.00001
I(age^2) 1.76432 -0.16556 0.00533 -0.00007 0.00000
I(age^3) -0.02291 0.00218 -0.00007 0.00000 0.00000
I(age^4) 0.00011 -0.00001 0.00000 0.00000 0.00000

Pointwise standard errors

\[\hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_{0} + \hat{\beta}_2 x_{0}^2 + \hat{\beta}_3 x_{0}^3 + \hat{\beta}_4 x_{0}^4\]

\[\text{Var}(\hat{f}(x_o))\]

Voter turnout and mental health

\[\Pr(\text{Voter turnout} = \text{Yes} | \text{mhealth}) = \frac{\exp[\beta_0 + \beta_1 \text{mhealth} + \beta_2 \text{mhealth}^2 + \beta_3 \text{mhealth}^3 + \beta_4 \text{mhealth}^4]}{1 + \exp[\beta_0 + \beta_1 \text{mhealth} + \beta_2 \text{mhealth}^2 + \beta_3 \text{mhealth}^3 + \beta_4 \text{mhealth}^4]}\]

Voter turnout and mental health

Step functions

  • Global structure
  • Local structure
  • Step functions
  • Binning

    \[y_i = \beta_0 + \beta_1 C_1 (x_i) + \beta_2 C_2 (x_i) + \dots + \beta_K C_K (x_i) + \epsilon_i\]

Biden and age

Regression splines

  • Extend monotonic transformations and piecewise constant regression by fitting separate polynomial functions over different regions of \(X\)

Piecewise polynomials

\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \epsilon_i\]

  • Piecewise cubic polynomial with 0 knots

    \[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \epsilon_i\]

  • Piecewise constant polynomial (degree \(0\))
  • Piecewise cubic polynomial with 1 knot

    \[y_i = \begin{cases} \beta_{01} + \beta_{11}x_i^2 + \beta_{21}x_i^2 + \beta_{31}x_i^3 + \epsilon_i & \text{if } x_i < c \\ \beta_{02} + \beta_{12}x_i^2 + \beta_{22}x_i^2 + \beta_{32}x_i^3 + \epsilon_i & \text{if } x_i \geq c \end{cases}\]

Piecewise polynomials

Constraints and splines

Constraints and splines

Constraints and splines

Choosing the number and location of knots

Choosing the number and location of knots

\(10\)-fold CV MSE

\(10\)-fold CV MSE

Optimal model

Local regression

  1. Gather the fraction \(s = \frac{k}{n}\) of training points whose \(x_i\) are closest to \(x_0\).
  2. Assign a weight \(K_{i0} = K(x_i, x_0)\) to each point in the neighborhood, so that the point furthest from \(x_0\) has a weight of 0, and the closest has the highest weight. All but these \(k\) nearest neighbors get a weight of 0.
  3. Fit a weighted least squares regression of the \(y_i\) on the \(x_i\) using the aformentioned weights.
  4. The fitted value at \(x_0\) is given by \(\hat{f}(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0\)

Local regression

Local regression

Local regression

Generalized additive models

  • Combine multiple predictors
  • Maintain additive assumption

GAMs for regression problems

\[y_i = \beta_0 + \beta_{1} X_{i1} + \beta_{2} X_{i2} + \dots + \beta_{p} X_{ip} + \epsilon_i\]

\[y_i = \beta_0 + \sum_{j = 1}^p f_j(x_{ij}) + \epsilon_i\]

\[y_i = \beta_0 + f_1(x_{i1}) + \beta_{2} f_2(x_{i2}) + \dots + f_p(x_{ip}) + \epsilon_i\]

GAM for Biden

\[\text{Biden} = \beta_0 + f_1(\text{Age}) + f_2(\text{Education}) + f_3(\text{Gender}) + \epsilon\]

GAM for Biden

GAM for Titanic